Neural-network decoders for measurement induced phase transitions

Open quantum systems have been shown to host a plethora of exotic dynamical phases. Measurement-induced entanglement phase transitions in monitored quantum systems are a striking example of this phenomena. However, naive realizations of such phase transitions requires an exponential number of repetitions of the experiment which is practically unfeasible on large systems. Recently, it has been proposed that these phase transitions can be probed locally via entangling reference qubits and studying their purification dynamics. In this work, we leverage modern machine learning tools to devise a neural network decoder to determine the state of the reference qubits conditioned on the measurement outcomes. We show that the entanglement phase transition manifests itself as a stark change in the learnability of the decoder function. We study the complexity and scalability of this approach in both Clifford and Haar random circuits and discuss how it can be utilized to detect entanglement phase transitions in generic experiments.

Open quantum systems have been shown to host a plethora of exotic dynamical phases. Measurement-induced entanglement phase transitions in monitored quantum systems are a striking example of this phenomena. However, naive realizations of such phase transitions requires an exponential number of repetitions of the experiment which is practically unfeasible on large systems. Recently, it has been proposed that these phase transitions can be probed locally via entangling reference qubits and studying their purification dynamics. In this work, we leverage modern machine learning tools to devise a neural network decoder to determine the state of the reference qubits conditioned on the measurement outcomes. We show that the entanglement phase transition manifests itself as a stark change in the learnability of the decoder function. We study the complexity and scalability of this approach in both Clifford and Haar random circuits and discuss how it can be utilized to detect entanglement phase transitions in generic experiments. Entanglement entropy in closed quantum systems that thermalize generally tends to increase until reaching a volume-law behavior with entanglement spread throughout the system 1,2 . Coupling to a bath profoundly changes the internal evolution of the system 3 , which in turn can suppress the growth of entanglement and correlations within the system to an area-law behavior 4,5 . A prominent example of such systems is random quantum circuits with intermediate measurements [6][7][8][9][10] . In these circuits, where the unitary time evolution of the system is interspersed by quantum measurements, the competition between unitary and non-unitary elements leads to a measurement-induced phase transition (MIPT) between a pure phase with an area-law and a mixed phase with a volume-law entanglement behavior  . Such entanglement phase transitions are only accessible when the density matrix is conditioned on the measurement outcomes while they are hidden from any observable which can be expressed as a linear function of the density matrix. On the other hand, to experimentally probe observables which are non-linear functions of the density matrix, one naively needs to reproduce multiple copies of the same state. However, due to intrinsic randomness in measurement outcomes, this naive approach requires repeating the experiment exponentially many times (in system size) 10,26 .
Building on the close connection between measurement-induced entanglement phase transitions and quantum error correction 11,12,[34][35][36][37] , a possible workaround to this obstacle was found in ref. 38 for purification transitions, which generically coincide with area-to-volumelaw entanglement transitions in random circuit models without symmetry or topological order 27 . It was shown how to probe these phase transitions through purification dynamics of an ancilla reference qubit that is initially entangled to local system degrees of freedom. Subsequently, the time dependence of the entanglement entropy of the reference qubits signifies the phase transition properties 11,15,38 . To employ this method, one needs to find the density matrix of reference qubits conditioned on the measurement outcomes of the circuit. Hence, the final objective of this approach is to obtain a "decoder" that maps the measurement outcomes to the density matrix of the reference qubit. However, such decoders are only known and implemented for special classes of circuits such as stabilizer circuits 9 . For more generic circuits like Haar-random circuits, finding an analytical solution to this problem is likely unfeasible.
Here, motivated by the recent successful applications of machine learning algorithms in quantum sciences 39 and especially optimizing quantum error correction codes and quantum decoders [40][41][42][43][44][45][46][47][48] , we provide a generic neural network (NN) approach that can efficiently find the aforementioned decoders. First, we sketch our physically motivated NN architecture. Although we use numerical simulations of Clifford circuits to show the efficacy of our NN decoder, we argue that in principle the same decoder with slight modifications should work for any generic circuit. We investigate the complexity of our learning task by studying the number of circuit runs required for training the neural network decoder. Importantly, we show that the learning task only needs measurement outcomes inside a rectangle encompassing the statistical light-cone 19,38 of the reference qubit. Furthermore, we demonstrate that by studying the temporal behavior of the learnability of the quantum trajectories, one can estimate the critical properties of the phase transition. We also verify that for large circuits one can train the NN over smaller circuits which provides evidence for the scalability of our method. Finally, we explain how our method can be applied to generic circuits with Haar random gates, and we study the temporal behavior of the averaged entanglement entropy for two values of measurement rate in the area-law and volume-law phases for a small ensemble of such circuits.

Model
The circuits that we study have a brickwork structure as in Fig. 1, with L qubits. We consider time evolution with T time steps with repetitive layers of two-qubit random unitary gates, followed by a round of single-site measurements of the Pauli Z operators at each site with probability p. As one tune p past some critical value p c , there is a phase transition from a volume-law entanglement behavior (p < p c ) to an area-law behavior (p > p c ) and a logarithmic scaling at the critical point (p = p c ). Crucially for this work, this phase transition is also manifested in the time dependence of the entanglement entropy of a reference qubit entangled with the system S Q (t) 38 . S Q (T), averaged over many circuit runs, is known as the coherent quantum information and plays a crucial role in the fundamental theory of quantum error correction 49 . For polynomials in system-size circuit depths, S Q (T) maintains a finite value in the volume-law phase and vanishes in the area-law phase. The protocol we use to probe S Q (t) is illustrated in Fig. 1a. Starting from a pure product state, we make a Bell pair out of the qubit in the middle and an ancilla reference qubit. Throughout the paper, we use periodic boundary conditions for the circuit.

Decoder
To find S Q (T) in the experiment, we need to find the density matrix of the reference qubit at time T, which is a vector inside the Bloch sphere and can be specified by its three components 〈σ X 〉, 〈σ Y 〉 and 〈σ Z 〉. Therefore, probing the phase transition can be viewed as the task of finding a decoder function F C for a given circuit C, such that where M T is the set of circuit measurement outcomes. Let p P ðm|M T Þ for P ∈ {X, Y, Z} denote the probability of getting reference qubit outcome m = ± 1 when measuring σ P of the reference qubit after time t = T, conditioned on the measurement outcomes M T . Since hσ P i = P m = ± 1 m p P ðm|M T Þ, the problem of finding the decoder F C is equivalent to finding the probability distributions p P ðm|M T Þ for P ∈ {X, Y, Z}.

Deep learning algorithm
Instead of finding p P ðm|M T Þ analytically for a given circuit C, we plan to use ML methods to learn these functions from a set of sampled data points which in principle could be obtained from experiments. The task of learning conditional probability distributions is known as the probabilistic classification task in ML literature 50,51 . Let us fix the circuit C and the Pauli P. A sample data point is a pair of ðM T ,mÞ for a single run of the circuit where M T is the circuit measurement outcomes and m is the of outcome of measuring the reference qubit in the σ P basis at the end of the circuit. By repeating the experiment N t times, we can generate a training set of N t data points. By training a neural network using this data set, we obtain a neural network representation of the function p P ðm|M T Þ.
Framing the problem as a probabilistic classification task does not necessarily mean that the learning task would be efficient. Indeed, given that the number of different possible M T outcomes scales exponentially with the system size, one would naively expect that the minimum required N t should also scale exponentially for the learning task to succeed, i.e., we need to run the circuit exponential number of times to generate the required training data set. However, the crucial point made in ref. 38 is that, when the reference qubit is initially entangled locally to the system, its density matrix at the end of the circuit only depends on the measurement outcomes that lie inside a statistical light cone, and up to a depth bounded by the correlation time that is finite in the system size away from the critical point. Hence, for a typical circuit away from the critical point, the function p P ðm|M T Þ depends only on a finite number of elements in M T and that makes the learning task feasible.
To show the effectiveness of this method, we test our decoder using data points gathered from numerical simulation of Clifford circuits with p c = 0.160(1) 8 , which enables us to study circuits of large enough sizes. Due to Clifford dynamics, the reference qubit either remains completely mixed at t = T or it is purified along one of the Pauli axis. This means the measurement outcome of σ P at the end of the circuit is either deterministic or completely random. Therefore, it is more natural to view the problem as a hard classification task (rather than probabilistic) where we train the neural network to determine the measurement outcome of σ P (see the "Methods" section). Note, if the reference qubit is purified at the end of the circuit, then the decoder can in principle learn the decoding function while, if it is not, then the measurement outcomes are completely random, leading to an inevitable failure of the hard classification. Thus, the purification phase circuits and the decoding protocol using neural-networks. a Brickwall structure of a hybrid circuit with random twoqubit Clifford gates interspersed with projective Z measurements and with periodic boundary conditions. M T denotes the measurement outcome matrix with matrix elements m i = {0, ± 1} (m i = 0 when the corresponding qubit is not measured, and m i = ± 1 when a qubit's Pauli Z is measured). Here, T = 3 for this example. b Neural network architecture: We use convolutional neural networks composed of C: convolutional, P: pooling, and F: fully connected layers, trained on quantum trajectories. The neural network implements a decoder function that predicts the measurement result for the reference qubit σ p using the measurement record in the circuit M T as input.
transition shows itself as a learnability phase transition. It is worth noting that we are only changing how we interpret the output of the NN, i.e. we pick the label with the highest probability, so the same NN architecture can be used for more generic gate sets. For simplicity, we also only look at the data points corresponding to the basis P in which the reference qubit is purified. In an experiment, the purification axis is not known, so one needs to train the NN for each of the three choices of P; if the learning task fails for all of them, it means the qubit is totally mixed. Otherwise, the learning task will succeed for one axis and fail for the other two (Note, for a fixed Clifford circuit, the purification axis does not depend on M T ), which means the reference qubit is purified.
Since locality plays an important role in purification dynamics, we employ a particular deep learning 52-54 architecture called convolutional neural networks (CNN) that are efficient in detecting local features in image recognition applications 55 . In utilizing these networks the input data is treated as a snapshot as in Fig. 1b with each pixel treated as a feature of the NN and the label of each image is the measurement outcome of σ P .

Learning complexity
For a fixed circuit C, we start the training procedure by training the NN with a given number of labeled quantum trajectory measurements and then evaluate its performance in predicting the labels of new randomly generated trajectories produced by the same circuit C. The learning accuracy 1−ϵ l is the probability that the NN predicts the right label. The minimum number of training samples denoted by M(ϵ l ) to reach a specified learning error ϵ l can provide an empirical measure of the learning complexity of the decoder function F C 56 . In what follows, we fix the learning error of each circuit to be ϵ l = 0.02.
In performing this analysis, different learning settings can be considered. Intuitively, for a fixed circuit, we expect the purification time of the reference qubit, t p , after which the reference qubit's state does not alter any further, to play an important role in determining M. Therefore, in our first learning setup, we consider a conditional learning scheme where for a given measurement rate, we select quantum circuits based on their purification time t p , which allows us to study the effect of the system size on the learning complexity. Moreover, we discard measurement outcomes corresponding to measurements performed after t p . This is to say that for each t p , measurement outcomes outside a mask with width L and height t p will be masked. Here, we note that given N c circuits with the same purification time, in addition to the learning efficiency of each circuit, we need to fix the learning inaccuracy averaged over N c circuits, δ l , which we fix to be δ l = 20%. We remark that this number is larger than ϵ l since some of the conditionally selected circuits have not been learned.
In the second setting, we remove the conditioning constraint and only consider the overall complexity of the learning task when we randomly generate circuits for a given p in a completely unconditional manner. The two schemes can be related using the probability distribution r p of the purification time as shown in Fig. 2a and explained more concretely in the methods section. We should emphasize that a conditional learning scheme is only a tool for studying the complexity of the learning problem for Clifford circuits. For probing the phases and phase transitions in both Clifford and Haar circuits, we use the unconditional learning scheme. Note that since the reference qubit is entangled locally at the beginning, there is always a finite probability that it will be purified in early times. In the mixed phase, the distribution has an exponentially small tail until exponentially long times (both in system size) whereas, in the pure phase, the ancilla purifies in a constant time independent of system size. Inspired by the approximate locality structure of hybrid circuits 38 , we also consider a lightcone learning scheme, where we train the NN using only the measurement outcomes inside a box centered in the middle (see below). In Fig. 2b, we compare the complexity of the conditional learning task in  the pure and mixed phases both by using the light-cone box (main) and whole circuit (inset) measurement data. For each purification time and p, we consider N c = 20 different circuits and we average over their minimum required training numbers to calculate Mðϵ l ,δ l Þ, and show the standard deviation as the error bar. Here, for all the curves, we observe an approximate exponential growth of Mðϵ l ,δ l Þ as a function of the purification time t p . By comparing the mixed and pure phases, we notice that the conditional learning task is more complicated in the pure phase than the mixed phase, which is expected since, all else being equal, there are more measurements in the pure phase. Additionally, as shown in the inset, we find that learning with light-cone data is less complicated than using all the measurement outcomes. These behaviors can be understood by recognizing that to learn the decoder we need to explore the domain of the mapping in Eq. (1) whose size scales exponentially with 2 pTL .
In Fig. 2c we compare the system size dependence of the complexity in the two phases with L = 64,128 where we train our networks with the light-cone data. We note that since the size of the light-cone box for a fixed t p is independent of the system size, we expect the asymptotic complexity to be independent of the system size. Our numerical observation is partially in agreement with this theoretical expectation. In the "Methods" section, this point has been studied further where we explicitly depict the system size dependence of the complexity for circuits with experimentally relevant system sizes L = {16, 32, 64, 128}. In the "Methods" section, we also obtain similar complexity results for circuits with initial states scrambled by a highdepth random Clifford circuit.
In the final step, we consider the unconditional learning task. Figure 2d shows the ratio of circuits that can be learned, denoted by R l , as a function of N t , with the circuit depth fixed at T = 10.
After an initial fast growth in R l , the learning procedure slows down. This can be understood by noting that exponentially more samples are required to learn the decoder for circuits with longer purification time. Moreover, the saturation value for each p is bounded by the ratio of circuits that are purified by time T, which can be expressed as where r p is the purification rate plotted in Fig. 2a.

Dynamics of coherent information
We can utilize the NN decoder to study the critical properties of the phase transition. For a fixed circuit configuration c with a given p, let ρ c and s c ðtÞ = À trðρ c log 2 ρ c Þ denote its density matrix and von Neumann entropy of the reference qubit after time t, respectively. Based on this definition, we let S Q ðtÞ = À P N c c = 1 1 N c trðρ c log 2 ρ c Þ denote the average entropy of the reference qubit after time t, i.e., the coherent quantum information of the system with 1 encoded qubit. We may assume on general grounds that S Q (t) follows an early-time exponential decay e −λt with λ following the scaling form: where z and ν are the dynamical and correlation length critical exponents respectively 9 . In stabilizer circuits, the density matrix of the reference qubit will be either purified completely with s c = 0, or will be in a totally mixed state with s c = 1. Since S Q (t) and the ratio of purified circuits R p (t) are related by S Q = 1−R p , we can estimate S Q (t) by the ratio of learnable circuits of depth t in the unconditional scheme described above. We denote the estimated value of S Q (t) from learning byS Q . More concretely: (1) For each given p and L we generate N c = 10 3 random circuits and we evolve them for T ∼ Oð10Þ time steps that do not scale with the system size and record the measurement outcomes M T ,ð2Þ At the end of this time evolution, we measure the spin of the reference qubits along the purification axis, m, (3) For each circuit we use the corresponding labeled data ðM T ,mÞ and we train our neural network with this data to make future predictions. We note that since in this approach, there is no constraint in generating the circuits and their quantum trajectories, this procedure can be directly applied to experimental data without requiring any post-selection or conditioning procedure.
In Fig. 3a we compare the temporal behavior of the coherent information obtained from an ideal decoder and the NN decoder introduced here where for each p we consider N c = 10 3 different circuit configurations. As demonstrated in Fig. 3a, in the mixed phase the learned entanglement entropy closely follows the simulated entanglement entropy, while in the pure phase, the two curves start to deviate from each other after a few time steps. This behavior is consistent with previous observations in Fig. 2 where we demonstrated that the learning task is easier in the mixed phase. Since at the critical point this phase transition can be described by a 1 + 1-D conformal field theory 6,8 , the dynamical critical exponent can be fixed in advance z = 1 and correspondingly we define the scaled time τ = t/L. Furthermore, since the argument of the scaling function f on the right-hand side of Eq. (3) becomes independent of L at p c , we expect to see a crossing in when it is plotted for different system sizes. Here, τ d = t d /L is the differentiation time which should be sufficiently large. In Fig. 3b, we evaluate the decay rate obtained by learning,λ τ d , for three different system sizes, L = {32, 48, 64}, at τ d = 1/16 usingS Q . The corresponding times are t d = {2, 3, 4} for which the deviation of the learned and simulated coherent information is negligible. Here, we notice an approximate crossing in the region 0.1 ≲ p c ≲ 0.15 signaling a phase transition in this region. More systematically, we may find the best-estimated values of the critical data by collapsing the decay rate curves according to the scaling ansatz in Eq. (3). In particular after fixing z = 1, we can search simultaneously for p c and ν so that the fitting error of the regression curve would be minimized (see the "Methods" section). The inverse error has been plotted as a function of p c and ν in Fig. 3c where we observe that the lowest error corresponds to the region p c ≃ 0.13, ν ≃ 1.5. Similarly, we can examine our assumption about the conformal symmetry of the transition, by fixing p c = 0.13, and allowing ν and z to vary as in Fig. 3d. Here, we observe that the lowest error corresponds to the region around ν ≃ 1.5, z ≃ 1. Using the obtained estimates, namely, ν ≃ 1.5, z ≃ 1, and p c ≃ 0.13, in the inset of Fig. 3b we collapse the three curves of L zλ as a function ofp = ðp À p c ÞL z=ν . In the "Methods" section, we search simultaneously over all three parameters and find that the best estimates for the critical data are in the region p c = 0.14 ± 0.03, z = 0.9 ± 0.15, and ν = 1.5 ± 0.3. Once the error margins are considered, these results are consistent with the results obtained from the half-chain entanglement entropy, z = 1, p c ≃ 0.16, and ν ≃ 1.3 6,8 . However, in order to differentiate this phase transition from the percolation phase transition 57 , more precise results for the critical exponents are required. Additionally, we verify our learning results by comparing them with the results obtained from exact simulations of S Q (t), where we demonstrate that by increasing L, t d , and N c , the phase transition parameters can be determined more accurately.

Scalability of learning
An important feature of a practical decoder is the possibility of training it on small circuits and then utilizing it for decoding larger circuits. Here, due to the approximate locality of the temporal evolution of the random hybrid circuits, one can examine the scalability of the decoders in a concrete manner. For a given circuit with L qubits, we generate smaller circuits with L B < L number of qubits which have identical gates as the original circuit in a rectangular narrow strip around the middle qubit which is entangled to the reference qubit. The geometry of the two sets of circuits is displayed in Fig. 4a where the depth of the two sets of circuits is chosen to be equal. Here, for each p we generate N c large circuits with L = {32, 64} and T = 10-time steps. We also only consider those circuits that are learnable using measurement outcomes from the original circuit. Next, for each of these circuits, for L B = {4, 8, ⋯ , 20} we generate their corresponding smaller circuits and we run them to generate N t = 5 × 10 3 quantum trajectories. In the training step, we use the quantum trajectories produced from the smaller circuits to train our neural networks. In the testing step, however, we use these neural networks to make predictions for the quantum trajectories obtained from the larger circuits. As we observe in Fig. 4b, by increasing L B the ratio of the circuits that can be learned by the smaller circuits' NNs increases. Also, consistent with the effective light-cone picture, we see that for both system sizes, L = {32, 64}, the largest required L B to reach almost full efficiency, according to the light cone condition can be determined by L B ≳ 2T which in our case corresponds to L B = 20. This demonstration provides evidence that independent of the system size, the light-cone-trained NNs can be used for learning larger circuits.

Generalization to Haar random circuits
To benchmark the methods, we have focused on Clifford circuits, which have two important simplifications for our learning procedure. First, the purification axis is independent of the measurement outcomes and the learning only needs to be performed along one of the  {X, Y, Z} axes in the Bloch sphere. In addition, the purification occurs at the specific layer of the circuit. Therefore, it is important to test our results in more generic Haar random circuits, where the purification axis can be along any radius in the Bloch sphere and purification dynamics occurs throughout the circuit evolution 15 . Here, we show how to adapt our method to Haar random circuits to see clear evidence of the two phases. We leave the study of critical properties of the entanglement phase transition with our method for future work.
To obtain the decoder function F C for generic circuits, we need to create three independent sets of labeled data for measuring σ i with i ∈ {X, Y, Z} obtained from quantum trajectories. Next, these three sets of labeled measurement data, represented by fM i T ,m i g, are used to train three independent neural networks to produce the probability distribution of reference qubit density matrix expectation values p i ðm|M T Þ. Consequently, given new quantum trajectories, the trained p i 's will be employed to estimate 〈σ i 〉. Finally, using standard density matrix tomography methods, such as the maximum likelihood estimation of the density matrix of a single qubit 58 , we can obtain the most likely physical density matrix associated with the predicted 〈σ i 〉's. An illustrative example of the learning dynamics in the two phases for a small number of circuits is shown in Fig. 5 where we study S Q (t) and its learned value as a function of time for a circuit with L = 8 qubits in the two phases (p c ≈ 0.17 for this model 15 ). We see from this example that our NN decoder straightforwardly generalizes to generic quantum circuits and using a larger circuit ensemble and quantum trajectories it should be possible to study the phase transition properties.

Discussion
We first note that since in our approach obtaining the critical exponents is obtained from the temporal behavior of the learning efficiency at long times, the most important obstacle in obtaining more accurate results for the critical exponents is the low efficiency of our learning algorithms for deep circuits in the area law phase. Hence, an intriguing possibility is to find state-of-art neural network architectures that are more efficient in learning deep circuits with local data 37 . Similarly, implementing neural network decoders for other MIPTs such as systems with long-range interactions 30 , and symmetric MIPT 59 , is an immediate extension of this work. As an alternative main future direction to explore, we note that from an experimental perspective, it is possible to incorporate different errors, which are common in the realization of the two-qubit gates and/or measurement processes, in our machine learning framework. Another intriguing question is to investigate whether it is possible to use our decoder approach for MIPTs where it is not equivalent to purification transitions. In the context of quantum error correction and fault tolerance, the purification dynamics in measurement-induced phase transitions lead to a rich set of examples of dynamically generated quantum errorcorrecting codes 11,34,60,61 . Designing similar decoders as considered here for other types of dynamically generated logical qubits is a rich avenue of investigation. We also highlight that our empirical complexity results raise interesting questions about the complexity of learning an effective Hamiltonian description 32,62,63 of the measurement outcome distributions for monitored quantum systems. Finally, we note that improving our neural network algorithms to find the optimal decoder and investigating the applicability of unsupervised machine learning techniques for this problem is left for future studies 64,65 .

Quantum dynamics
The dynamics of hybrid circuits considered in this work in general can be described using the quantum channel formalism. The wave function of the circuit, denoted by |ψ S at the beginning of time evolution is entangled to a reference qubit. Formally, the time evolution of the system under this setting can be modeled using Kraus operators 66 , where m t , U t and P m t t , denote the measurement outcomes, unitary gates, and projective measurements at the tth layer of the circuit, respectively. We also denote the set of all measurement outcomes in different layers via m ! . The corresponding evolution of the density matrix, ρ, can be described via the following quantum channel: For our purpose, to generate the quantum trajectories we need to consider the time evolution of the system at the level of the wave functions. Under an arbitrary unitary operator U, the wave function evolves as For projective measurements, we consider a complete set of orthogonal projectors with eigenvalues labeled by m satisfying P m P m t = 1 and P m t P m 0 t = δ mm 0 P m t under which the wave function evolves as, In simulating the time evolution of the wave functions, we use random unitaries sampled from the Clifford group where, under any conjugation operation, the Pauli group is mapped to itself 67 . Such circuits, according to the Gottesman-Knill theorem, can be classically simulated in polynomial times in the system size 68,69 .

Implementation of deep learning algorithms
In this work, we mainly used convolutional neural networks for learning the decoder function. These networks are composed of several interconnected convolutional and pooling layers. The convolutional layer uses the locality of the input data to create new features from a linear combination of adjacent features through a convolution process. These layers are followed by pooling layers which reduce the number of features. Finally, a fully connected layer is used to associate a label to the newly generated features, thus classifying the data. These layers can be repeated a number of times for more complicated input data. Our neural network architecture symbolically displayed in Fig. 1b consists of eight layers whose hyperparameters are chosen by an empirical parametric search to optimize the learning accuracy when the number of samples are smaller than 5 × 10 4 . From left to right these layers include: (1) a convolutional layer with a L q /2 filters where L q is the number of qubits with a kernel size of 4 × 4, and a stride size of 1 × 1 with a rectified linear unit (ReLu) activation function, (2) a convolutional layer with a L q /2 filters where L q is the number of qubits with a kernel size of 3 × 3, and a stride size of 1 × 1 with a Relu activation function, (3) a maximum pooling layer with a window size of 2 × 2 to decrease the dimension of the input data, (4) a dropout layer with a dropping rate of r d = 0.2 to prevent overfitting, (5) a flattening layer to convert the data into a one-dimensional vector, (6) a dense fully connected layer with a Relu activation function whose number of output neurons is variable and is determined according to the number of training samples, N n = 512*(1 + 2⌊N t /2000⌋) where ⌊x⌋ denotes the floor function of x, (7) a dropout layer with a dropping rate of r d = 0.2, (8) a dense fully connected layer with a sigmoid activation function which generates the prediction for the spin of the reference qubit. Finally, since we have a classification problem, the loss function for comparing the predicted labels and the actual labels is a binary cross-entropy function. Using this loss function, for training our neural network model, we use the Adam optimization algorithm with a learning rate l = 0.001. The implementation of our neural network layers and their optimization was done by the Python deep-learning packages Ten-sorFlow and Keras.

Scaling analysis and estimation of critical exponents
The critical exponents of this measurement-induced phase transition can be investigated from the decay rate of the reference qubit's entanglement entropy denoted by λ, which has the scaling (see Eq. (3)) While in the main text, we fixed z = 1 based on the assumption of conformal invariance, here, we perform the analysis with z allowed to vary. To find the best combination of the critical data that collapses our data according to this ansatz, we compare the normalized mean squared errors (NMSE), ε NMSE such that the best fit is obtained when ε À1 NMSE is maximized 70 . In particular, for a given p c , ν, and z, using cubic polynomials we first find the regression curve of y ≡ L z λ as a function of (p−p c )L z/ν , and then we evaluate the corresponding value of the mean squared error between y and the best-fitted value of itŷ. We point out that in order to compare mean squared errors for different combinations of (p c , ν, z), we have to normalize the data by defining dimensionless deviations and then evaluate the NMSE for different combinations of critical data according to ε NMSE = P i ðŷ i À y i Þ 2 =y 2 i where the summation is performed over data for all the measurement rates and system sizes.
The results of this analysis are displayed in Fig. 6, where we have plotted ε À1 NMSE as a function of ν and p c for six different values of z ranging from 0.75 to 1.25. Based on the subplots in this figure, we observe that the highest values for ε À1 NMSE are obtained for z ≃ 0.85 − 0.95 which is quite close to the value expected from theoretical results based on conformal symmetry z = 1. Allowing ε À1 NMSE to vary within almost 10% of its maximum value, we obtain following range for the best fits of the critical data, p c = 0.14 ± 0.03, ν = 1.5 ± 0.3, and z = 0.9 ± 0.15.
Finally, we compare our results with the results obtained directly from exact numerical simulations of Clifford circuits without employing our learning scheme. The results of such simulations for the decay rates for different system sizes have been displayed in Fig. 7. In the left subplot we have shown the results for the same system sizes as used for our learning simulations where we observe a crossing of the curves at p c ≃ 0.13 which supports our results obtained from the learning scheme. Furthermore, in the right subplot we observe that for larger system sizes, the obtained crossing of the curves is around p c ≃ 0.16 which is very close to the results obtained from half-chain entanglement entropy 6,8 . Accordingly, we expect that by increasing L and N c , the estimates obtained from our learning scheme should improve.

Key measurements in Clifford circuits
Consider a hybrid Clifford circuit C which has M Pauli measurements. Imagine applying this circuit on an initial stabilizer state which is entangled to a reference qubit. Assume that as a result of this, the reference qubit disentangles and purifies into the |P; p R state, where P is one of the Paulis and p R = ± 1 determines which eigenvector of P the reference qubit has been purified into. Let s 1 , ⋯ , s M = ± 1 denote the measurement outcomes for a single run of the circuit. If we run the same circuit again, the ancilla will purify in the same basis P, but we may get different p R as well as different s i . The goal is to understand the relation between the value of p R and the measurement outcomes fs i g M i = 1 .
When a Pauli string is measured on a stabilizer state, the result is either predetermined (in case the Pauli string is already a member of the stabilizer group up to a phase) or it is ±1 with equal probability. We call the former determined measurements and the latter undetermined measurements. Note that in a stabilizer circuit, whether a measurement is determined or undetermined is independent of previous measurement outcomes. Therefore, for a given circuit C and a fixed ordering of performing measurements, it is well-defined to label measurements as either determined or undetermined without referring to a specific circuit run.
The following is a straightforward result of the Gottesmann-Knill theorem: Corollary. There exists a unique subset of undetermined measurement results fs j 1 , Á Á Á ,s j m g (which we call key measurements) such that, where c = ± 1 is the same for all circuit runs. We call this set the key measurements set. Note that since key measurements are undetermined measurements, their value are independent of each other. Hence, to predict p R from undetermined measurement outcomes with any accuracy better than 1/2, one needs to have access to all key measurement results.
Each determined measurement can be seen as a constraint between previous undetermined measurement outcomes. Specifically, if s i is a determined measurement result for some i it means that there is some fixed c 0 = ± 1 (independent of circuit run) and a subset of undetermined measurements fs j 0 1 , Á Á Á ,s j 0 m g such that The similarity to the Corollary is not accidental: if the reference qubit is purified in the P Pauli basis, it means that measuring it in the P basis would be a determined measurement. The existence of these constraints then means that if we relax the condition of the measurements being undetermined in Corollary, then the set of key measurements is no longer unique; we may be able to  Average of the minimum number of training samples required for learning the reference qubit's state when using circuits with scrambled initial states is plotted after conditioning on the purification time t p , for p = 0.1 (mixed phase) and p = 0.3 (pure phase). Averaging is performed over N c = 20 circuits for each t p and error bars are set according to the standard deviation. We have circuits with L = 128 qubits. In the main plot measurement outcomes from inside the fixed lightcone box are used for training while for the inset we use the measurement outcomes from the whole circuit.

Relation between conditional and unconditional learning schemes
Here, under certain conditions, we argue that the results of the two learning schemes as displayed in Fig. 2 are related to each other. In particular, using the purification-time distribution of the circuits in Fig. 2a, learnability R l (N t ), is related to the purification ratio r p (t p ). In what follows to make our analysis more intelligible, we assume that the learning error is nearly vanishing, ϵ l ≃ 0. Next, we need to study the averaged learning efficiency of our decoder which for a given t p and N t we denote by η l (t p , N t ). For a given t p and N t , this quantity is related to the averaged inaccuracy introduced in the text by η l = 1−δ l . To proceed, we employ a simplifying assumption that is approximately consistent with our numerical results. More concretely, we imagine a decoder with a sharp step-like behavior for η l (t p , N t ) as a function of N t . Using the Heaviside theta function θ H (x), we suppose η l (t p , N t ) = θ H (N t −M(t p )) where M(t p ) is the minimum number of training samples to reach full efficiency for t ≤ t p . From the definitions, it follows straightforwardly that where t Max p ðN t Þ is the maximum purification time that can be learned for a given N t . However, this quantity can be evaluated by inverting the function M(t) according to t Max p ðN t Þ = M À1 ðN t Þ where M −1 (N t ) is the inverse function of M(t p ). Now, we notice that M(t p ) after averaging over different circuits, can be read from the averaged minimum number of training samples in Fig. 2b. Therefore, by integrating the information in Fig. 2a and b plus η l (t p , N t ), one can explain the behavior of R l (N t ) in Fig. 2d. Here, although we do not have the explicit form of η l (t p , N t ), we use the step-like behavior as an approximation which is justifiable due to the exponential behaviors of the complexity as a function of the purification time. Thus, using Eq. (12) as a plausible approximation for the learnability of our decoder, we expect that during the initial fast growth of the curves in Fig. 2a, learned circuits mostly belong to the circuits with short purification times. However, since for longer purification times, an exponentially large number of training samples is required, the initial exponential growth is followed by a slow learning curve. Therefore, in Fig. 2d, we observe that deep in the pure phase where the majority of circuits have a short purification time, R l asymptotically approaches one.

Complexity results for scrambled initial states
Here, we present our results for the circuits scrambled by a high-depth random Clifford circuit. Concretely, to obtain such states, we first run our circuits with the initial product states only with two-qubit random Clifford gates in the absence of any measurements. This unitary time evolution creates a highly entangled state after T~L time steps with an entanglement entropy proportional to the system size. Next, we entangle the reference qubit to one of the circuit's qubits and run the same circuit in the presence of two-qubit gates and random measurements. As shown in ref. 11, there is a purification phase transition such that for p < p c the subsystem entanglement entropy of the circuit after T~L still has a volume-law behavior while for p > p c , its entanglement entropy is negligible. Using such initially mixed states, the complexity results are displayed in Fig. 8. Here, as in Fig. 4, we observe a nearly exponential behavior with the purification time. Furthermore, we notice that the conditional learning scheme is more difficult in the pure phase compared to the mixed phase. By comparing the inset and main plots, we also observe that learning with the light-cone data requires fewer training samples. Finally, by comparing Figs. 2b and 8 we observe that learning the circuits with scrambled initial conditions requires more training samples than the circuits with product state initial conditions. Finally, we present further results for the system-size dependence of the sampling complexity of our approach in Fig. 9 where we only use the light-cone measurement outcomes. The x-axis represents the system size which includes L = {16, 32, 64, 128}. Different curves represent different purification times spanning t p = {1, ⋯ , 6}. In the left panel of this figure, we have displayed our results for p = 0.3 corresponding to the area-law phase and in the right we have displayed our results for the volume-law phase with p = 0.1. Our results after taking the error bars into account can be indicative of a nearly system-size independent behavior. However, we should note that since the NN decoder that we have employed for these simulations is not necessarily the optimum decoder, we expect some deviation from an exact system-size independent behavior. Changing the system size by a factor of 8, the sample complexity increases by a factor of 2 on average and a factor of 4 on the tails. For more definitive results, we need to consider larger ensembles of circuits with larger N c and also increase the system size, which would be beyond the scope of this work.

Data availability
Source data for figures in the main text are provided with this paper. Data that support the plots within this paper and other findings of this study are generated and protected by the Extreme Science and Engineering Discovery Environment (XSEDE), at the Pittsburgh Supercomputing Center and are available from the corresponding author upon request. Source data are provided with this paper.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/ licenses/by/4.0/.